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Abstract. In this paper we study the relationship between query and 
search engine by exploring the selective properties based on a simple 
search engine. We used the set theory and utilized the words and terms 
for defining singleton and doubleton in the event spaces and then pro- 
vided their implementation for proving the existence of the shadow of 
micro-cluster. 
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1 Introduction 

Numerious studies of natural language processing (NLP) and Semantic Web uti- 
lize a search engine, mainly to obtain a set of documents, mainly to obtain a set 
of documents that include a given query and to get statistical information about 
an object such as entity name in hit count [T] by the search engine, but to bring 
the NLP and Semantic Web to life such as the information processing services 
provide the knowledge, for example: ontology construction, knowledge extrac- 
tion, question answering, and other purposes, all need more effort. However, to 
produce the enhanced relationship between a search engine and a query as novel 
property, we already defined some instances about simple search engines: single- 
ton [2] and doubleton [3J spaces. This model based on the simple architecture of 
search engine for representing the collection of documents in general such that 
this model can distinguish the features of Web documents, whereby any query 
can gives the essential purpose of Information Retrieval 0] strategies to meet 
user needs. Therefore, this paper aims to address some properties based on re- 
lations between singleton and doubleton in a triplet. We also provided the basis 
of this model to the micro-cluster for implementing the adaptive properties. 



2 Some Terminologies 

We defined some terminologies as follows |H2l3j . 
* A draft 
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Definition 1. A term t x consists of at least one or a set of words in a pattern, 
or tk = (w\W2 ■ ■ ■ wi), I < k, k is a number of parameters representing word w, 
I is the number of tokens (vocabularies) in tk, \tk\ = k is size of tk- ■ 

Definition 2. Let a set of web pages indexed by search engine be SI, i.e., a 
set contains ordered pair of the terms and the web pages u>kj> or (t,,^), 
i = 1,..., J, j = 1, . . . , J. The relation table that consists of two columns tk 
and uik is a representation of (i^jUt.) where Slk = {(tk,u)f.)ij} C Q or S2k — 
{uJkn ■ ■ ■ ,t^>kj}- The cardinality of SI is denoted by \S1\. ■ 

Definition 3. Let t x is a search term, and t x £ S where S is a set of singleton 
search term of search engine. A vector space Sl x C Q is a singleton search 
engine event ( singleton space of event ) of web pages that contain an occurrence 
of t x G lu x . The cardinality of S2 X is denoted by \ S2 X \ . ■ 

Definition 4. Let t x and t y are two different search term, t x ^ t y , t Xl t y £ S, 
where S is a set of singleton search term of search engine. A doubleton search 
term is T> = {{t x ,t y } : t x ,t y G S} and its vector space denoted by Sl x n Sl y 
is a double search engine event ( doubleton space of event ) of web pages that 
contain a co-occurrence of t x and t y such that t x ,t y G uj x and t x ,t y G ui y , where 

Some adaptive properties are denned to know the efficient ways to access 
information by using simple search engine model. In general, all adaptive prop- 
erties is to adopt the meaning of singleton and doubleton in equations as follows 

|^x| - \n x \ + Qy\ 

and 

\q x n S2 y \ = \Q X n Q y \ + \si x n Q x \ + \si y n si y \ 

Therefore, statistically either singleton or the doubleton contain bias informa- 
tion. 

Usually to improve the quality of statistical information by a search engine 
of a given query, the count is processed statistically based on above properties. 
However, to make an additional improvement, we must devote more attention 
to results of search engine and carefully handle the count for developing the 
selective model. 

3 The Selective Properties 

The purpose of selective properties is to construct an approach for eliminating 
bias by using the selected results of simple search engine. One of results by a 
search engine as follows [5] . 

Definition 5. Let t x is a search term. S = {wx, . . . , w max } is a Web snippet 
(briefly snippet), S C Lo Xi G SI, where max < 50 words to the left and right oft x 
that returned by any search engine. L — {Si : i = 1, . . . , n} is a list of snippets. 
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3.1 Triplet 

A construction of relationship based on frequency of words between search term, 
snippets, and words is as follows [6]. 

Definition 6. A relationship between search term, web snippets and words is 
defined as the mixture p(t a ,S,w) ~ t D x S x w, t a G O, SeLC[2,w£S.A 
vector space of P(t Q , S, w) is defined as w = {wi, . . . ,Wj}. ■ 

= [vi, . . . , i/j], Vi > . . . > Uj, where Wi, . . . ,Wj are the unique words in S and 
i/j, . . . ,Vj are the weights of word. 

The relations of the search term and the Web snippets and the words, we 
called it as triplet, or we rewrote as a term-snippet- word. The triplet is a base for 
exploring features of: Web pages or Web documents. The features exploration is 
to describe an object literally in text if the purpose of search term is to explain 
the object. A relation between term and snippet logically is \t$ n 5| = 1 if t Q 6 S 
and = otherwise, and 

P(« nS) = ~ (l) 

or probability of the search term in list of snippets are 

P(f„ni) = ^. (2) 

A relation of snippet-word interpretated as follows, 

m 1 

p(snw) = ]T (3) 

^— ' max 

3 = 1 

where m is a number of same word in vocabulary, or probability of the word in 
list of snippets is as follows 

n m 
i=l j=X 

while the term- word has two representations logically, i.e. \t a Dw\ = X^jli 
if to € S and = if to (jf S, or probability of t a n w in S as follows 

P((„nw),^i=. (5) 

Probability of t Q n w in L to be 

n v^m 1 

P(i nw) I = ^ ^ max 

i=l 

n m 1 

p(. nw) i = EE^- ^ 

i=i j=i 

Trivially for each snippet there exists a set of words w = {wi\i = 1, . . . , n}, 
i.e. w contains at least one word of search term or the search term self. 
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Definition 7. Word frequency is a word number uniquely in a set of words, 
i.e. 3f G S Vw G w, -ft is a real number set, and v G 3ft as a weight of word. 
Generally, there is 1 : 1 function w such that 



In this case, 3ft as a vector space o/w. ■ 

Lemma 1. If a set of words is representation of snippets in list of snippets, then 
vector space of words set contains probability of word in snippets. 

Proof. In Eq. ([3]) as probability of word based on frequency of word in snippet 
where m is number of word uniquely in snippet. Therefore, Eq. Q also is proba- 
bility of word based on frequency of word in list of snippets. Reasonally, because 
\t Q n w| ^ 0. Based on Eq. © and Eq. © we have 



as probability of word in a list of snippets for a search term, and 3ft contains the 
value of Eq. ^ for all rogw.l 

Lemma 2. If a set of words is representation of snippets in list of snippets, then 
the set of words is an event space. 

Proof. As a search term each word in w based on Definition [3] has f2 w , and each 
two words in pair based on Definition 3] has fl Wi D f2 Wj . Thus w be an event 



The event space contains vectors [ii — \ fl Wi \, i = 1, . . . , n, and we called it as 
singleton event space. 

Proposition 1. Ifp(t a ,S,w) is a triplet for t D , then there are at least one vector 
space of p(t Q , S,w). 

Proof. The direct consequence of Lemma Q] and Lemma [21 I 

Thus, based on Lemma [T] and Lemma [U there are two vector spaces for a list 
of snippets, i.e. 

Definition 8. A set of words w is a context if w has two vector spaces such 
that satisfies 

1. for [l/i, . . . , Uj] as vector space, i>i > . . . > Vj, and 

2. for [fii, . . . , fij] as singleton vector space, fii > . . . > fij. ■ 



w : w — > 3ft 



(7) 




space. I 
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3.2 Micro-cluster 

In part of section, we define the words undirected graph G = (V, E) to describe 
the relations between words [TJ. 

Definition 9. Assume a sub-graph G' , G' C G, G' is a micro-cluster satisfies 
the conditions as follows 

1. There are a set of word w = {lOj, . . . , Wj} whose vector space [vi, . . . , Vj] and 
Vi > . . . > Vj > a, where a is a threshold. 

2. There are an one-one function f : w — > V such that f(w) — v, Vw G ~w3v G 
V where v G 7 is a vertex in G 1 . 

3. There are an one-one function p : w x w — > E such that p(Wi,Wj) = e. 
\/wi,Wj G w, where p is a relation among words and e G E is a edge in G' . 

The micro-cluster is denoted by G' — (V, E, w, /, p, a). ■ 

A micro-cluster is maximal clique sub-graph of entity name where the node 
represents word that the highest score in document. However, let there is a 
set of words w whose weights above the threshold, the collection of words do 
not exactly refer to the same entity. To group the words into the appropriated 
cluster, we construct the trees of words. This based on an assumption that the 
words are that appear in same domains having closest relation. The tree is an 
optimal representation of relation in graph G. 

Definition 10. A tree T is an optimal micro-cluster if and only if T is a sub- 
graph of micro cluster G' , and is denoted by T — (Vt, Et, Wt, /, p, a), where 
Vt C V , Et C E, and wj- C w. ■ 

In building the optimal micro-cluster, we save the strongest relations in T be- 
tween a word and another in G' until T has no cycle. A cycle is a sequence of two 
or more edges (vi,Vj), (vj,Vk), . .., (vk+i,Vi) G E such that there is an optimal 
edge (vi,Vj) G E connects both ends of sequence. Let a word is introduced as 
intrusive word about an entity, and there are at least one word of optimal micro- 
cluster has strongest relations with the entity, and an optimal micro-cluster is a 
group of words refer to that entity. However, the overlap keyword also exists in 
the same list. We define a strategy to select a relevant keywords among all list 
candidates. In this case, there are a few potential keywords for identifying the 
entity name. 

Definition 11. A vector space s = [|wi|, . . . , |wj|] is a mirror shade of micro- 
cluster G' if there is an one-one function g : w — > s, where Wi, . . . ,wj are in 
event space. Let z is a vector whose greatest value in s, the vector space in range 
of [0, 1] is relatively defined as Smi] = [|wi|/|z|, . . . , |wj|/|z|] = [pi, . . . ,Pj].M 

We also can generate for example another vectors from fii,,.,,Qj for words 
Wi, . . . ,Wj respectively such that [pi, . . . , pj] = [J7j, . . . , Qj] is a mirror shade of 
[i/i,..., Vj] from a set of word frequencies. 
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Theorem 1. Let St C s, then St is the mirror shade of an optimal micro-cluster 
T. 

Proof. Let St Q s, based on Definition [TT1 we have w-r C w, i.e. j(wt) = St 
or because of g is one-one function, g~ 1 (sx) = C w. Next, by applying 
Definition |9l /(wr) = Vr, or because of / is one-one function, f~ 1 (Vr) = 
wt C w, and st = j(wy) = <?(/ (Vy)) = / _1 <7(Vr) and we obtain p(w,w) = 
Pif-'WJ-^V)) C £, so p(s T x s T ) = p(g(w T ) x , 9 (w T )) = p^/" 1 ^)) x 
S(/ -I (^r))) = Kr 1 g(^T))xr 1 . g (F T ))) - /- 1 .9p(^txVt) = /-^WVtXUt)) 
because of is also one-one function, this means that Vr C V has as a 

mirror shade of wy. I 

4 Conclusions and Future Work 

The selective properties have been derived from the singleton and doubleton 
based on the tiplet concept. Through these properties have been proven the 
existence of the shadow of any micro-clusters for the space of events. Our future 
work is about the relation between adaptive and selective properties for exploring 
an overlap principle. 
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